Run Data Profiling

On enabling DQLabs in erwin Data Intelligence (erwin DI), DQLabs APIs pull environment connection information from erwin DI and creates catalogs in DQLabs. You can add datasets to the catalogs, run data profiling, and enable drift alerts.

Once the data quality analysis is complete, DQLabs displays data quality analysis for catalogs (environments), datasets (tables), and attributes (columns). You can sync the analysis results to erwin DI and view them in the Metadata Manager.

Data quality analysis is available for environments using Oracle, Salesforce, Snowflake, MySQL, MSSQL, Hadoop, and PostgreSQL database types.

Before running data profiling, ensure the following pre-requisites are met:

  • Configure DQLabs in erwin DI. To configure DQLabs, refer to the Configuring DQLabs topic.
  • Switch the Enable DQ Sync option on for the environment. To enable this option, refer to the Managing Environments topic.

This topic walks you through adding datasets, enabling drift alert, and running data profiling on the catalog for Snowflake database as an example. Similarly, you can run data profiling for other databases.

To run data profiling on a catalog, follow these steps:

  1. Go to Application Menu > Data Quality.
    Your DQLabs instance opens. Login to DQLabs (if asked).
  2. On the DQLabs menu, click .
    The Catalog page appears. This displays catalogs (environments) that have the Enable DQ Sync option switched on in erwin DI.
  3. Select a catalog.
    The catalog overview page appears.

  4. Click the Datasets tab.
  5. Click, to add datasets.
    The catalog configuration page appears. This page displays environment connection information and datasets in the catalog.
  6. Enter the environment connection details, and then click Validate.
    The catalog configuration page displays the datasets in the catalog.

    On the catalog configuration page, you can use the following options:
    • Pull and Push Only: Switch this option on to perform only pull and push when running data profiling.
    • Properties (): Use this option under the Actions column to load datasets based on recent number of days or percentage of data.
  7. Select datasets, and click Connect.
    Alternatively, select the check box next to the Datasets column to add all the datasets in a catalog.
  8. Data quality analysis might take some time depending on the size of data on the catalog. You can view the status on the Execution Logs tab.

After profiling data, the data quality analysis for the catalog, datasets, and attributes are displayed. You can enable drift alerts for attributes available on datasets. To enable drift alerts, refer to Enabling Drift Alerts topic.
For example, the following screenshot displays DQ Score, and Impact Score for a catalog. You can further drill down to DQ Scores for attributes.

You can sync these results to erwin DI and view them in the Metadata Manager. To sync data quality analysis results, you need to schedule a sync job. For more information, refer to the Scheduling Jobs topics.

You an further work on datasets. On the Datasets tab, under the Actions column, use the following options for a dataset:

  • Edit (): Use this option to view dataset information edit it.
  • Schedule Profile (): Use this option to schedule a data profiling job at predefined intervals.
  • Run Now (): Use this option to run the data profiling job right away. This overrides the scheduled jobs.
  • Expand (): Use this to expand and view data quality analysis for attributes in a dataset. You can click the DQ Score for an attribute to view detailed analysis. The Profile tab for the attribute appears. For example, the following screenshot displays attribute's data quality analysis in tabular format and charts. For more information on charts, refer to Data Quality Charts topic in DQLabs user guide.